Goto

Collaborating Authors

 right arm


InternData-A1: Pioneering High-Fidelity Synthetic Data for Pre-training Generalist Policy

Tian, Yang, Yang, Yuyin, Xie, Yiman, Cai, Zetao, Shi, Xu, Gao, Ning, Liu, Hangxu, Jiang, Xuekun, Qiu, Zherui, Yuan, Feng, Li, Yaping, Wang, Ping, Cai, Junhao, Zeng, Jia, Dong, Hao, Pang, Jiangmiao

arXiv.org Artificial Intelligence

Recent works explore how real and synthetic data contribute to Vision-Language-Action (VLA) models' generalization. While current VLA models have shown the strong effectiveness of large-scale real-robot pre-training, synthetic data has not previously demonstrated comparable capability at scale. This paper provides the first evidence that synthetic data alone can match the performance of the strongest $π$-dataset in pre-training a VLA model, revealing the substantial value of large-scale simulation. The resulting model also exhibits surprisingly zero-shot sim-to-real transfer on several challenging tasks. Our synthetic dataset, InternData-A1, contains over 630k trajectories and 7,433 hours across 4 embodiments, 18 skills, 70 tasks, and 227 scenes, covering rigid, articulated, deformable, and fluid-object manipulation. It is generated through a highly autonomous, fully decoupled, and compositional simulation pipeline that enables long-horizon skill composition, flexible task assembly, and heterogeneous embodiments with minimal manual tuning. Using the same architecture as $π_0$, we pre-train a model entirely on InternData-A1 and find that it matches the official $π_0$ across 49 simulation tasks, 5 real-world tasks, and 4 long-horizon dexterous tasks. We release the dataset and will open-source the generation pipeline to broaden access to large-scale robotic data and to lower the barrier to scalable data creation for embodied AI research.


XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations

Fan, Shichao, Wu, Kun, Che, Zhengping, Wang, Xinhua, Wu, Di, Liao, Fei, Liu, Ning, Zhang, Yixue, Zhao, Zhen, Xu, Zhiyuan, Li, Meng, Liu, Qingjie, Zhang, Shanghang, Wan, Min, Tang, Jian

arXiv.org Artificial Intelligence

Recent progress in large-scale robotic datasets and vision-language models (VLMs) has advanced research on vision-language-action (VLA) models. However, existing VLA models still face two fundamental challenges: (i) producing precise low-level actions from high-dimensional observations, (ii) bridging domain gaps across heterogeneous data sources, including diverse robot embodiments and human demonstrations. Existing methods often encode latent variables from either visual dynamics or robotic actions to guide policy learning, but they fail to fully exploit the complementary multi-modal knowledge present in large-scale, heterogeneous datasets. In this work, we present X Robotic Model 1 (XR-1), a novel framework for versatile and scalable VLA learning across diverse robots, tasks, and environments. XR-1 introduces the \emph{Unified Vision-Motion Codes (UVMC)}, a discrete latent representation learned via a dual-branch VQ-VAE that jointly encodes visual dynamics and robotic motion. UVMC addresses these challenges by (i) serving as an intermediate representation between the observations and actions, and (ii) aligning multimodal dynamic information from heterogeneous data sources to capture complementary knowledge. To effectively exploit UVMC, we propose a three-stage training paradigm: (i) self-supervised UVMC learning, (ii) UVMC-guided pretraining on large-scale cross-embodiment robotic datasets, and (iii) task-specific post-training. We validate XR-1 through extensive real-world experiments with more than 14,000 rollouts on six different robot embodiments, spanning over 120 diverse manipulation tasks. XR-1 consistently outperforms state-of-the-art baselines such as $π_{0.5}$, $π_0$, RDT, UniVLA, and GR00T-N1.5 while demonstrating strong generalization to novel objects, background variations, distractors, and illumination changes. Our project is at https://xr-1-vla.github.io/.


13 yoga positions to do every day for increased flexibility

Popular Science

Flexibility is an essential part of staying fit. Breakthroughs, discoveries, and DIY tips sent every weekday. In your efforts to exercise, chances are you've worked on improving the four components of physical fitness. The problem is there are actually five . Criminally overlooked in the pursuit of big-ticket goals like strength, endurance, lung capacity and body composition is flexibility.


RoboChemist: Long-Horizon and Safety-Compliant Robotic Chemical Experimentation

Zhang, Zongzheng, Yue, Chenghao, Xu, Haobo, Liao, Minwen, Qi, Xianglin, Gao, Huan-ang, Wang, Ziwei, Zhao, Hao

arXiv.org Artificial Intelligence

Robotic chemists promise to both liberate human experts from repetitive tasks and accelerate scientific discovery, yet remain in their infancy. Chemical experiments involve long-horizon procedures over hazardous and deformable substances, where success requires not only task completion but also strict compliance with experimental norms. To address these challenges, we propose \textit{RoboChemist}, a dual-loop framework that integrates Vision-Language Models (VLMs) with Vision-Language-Action (VLA) models. Unlike prior VLM-based systems (e.g., VoxPoser, ReKep) that rely on depth perception and struggle with transparent labware, and existing VLA systems (e.g., RDT, pi0) that lack semantic-level feedback for complex tasks, our method leverages a VLM to serve as (1) a planner to decompose tasks into primitive actions, (2) a visual prompt generator to guide VLA models, and (3) a monitor to assess task success and regulatory compliance. Notably, we introduce a VLA interface that accepts image-based visual targets from the VLM, enabling precise, goal-conditioned control. Our system successfully executes both primitive actions and complete multi-step chemistry protocols. Results show 23.57% higher average success rate and a 0.298 average increase in compliance rate over state-of-the-art VLA baselines, while also demonstrating strong generalization to objects and tasks.


From Classical Data to Quantum Advantage -- Quantum Policy Evaluation on Quantum Hardware

Hein, Daniel, Wiedemann, Simon, Baumann, Markus, Felbinger, Patrik, Klein, Justin, Schieder, Maximilian, Stein, Jonas, Schuman, Daniëlle, Cope, Thomas, Udluft, Steffen

arXiv.org Artificial Intelligence

IQM Germany Abstract--Quantum policy evaluation (QPE) is a reinforcement learning (RL) algorithm which is quadratically more efficient than an analogous classical Monte Carlo estimation. It makes use of a direct quantum mechanical realization of a finite Markov decision process, in which the agent and the environment are modeled by unitary operators and exchange states, actions, and rewards in superposition. Previously, the quantum environment has been implemented and parametrized manually for an illustrative benchmark using a quantum simulator . In this paper, we demonstrate how these environment parameters can be learned from a batch of classical observational data through quantum machine learning (QML) on quantum hardware. The learned quantum environment is then applied in QPE to also compute policy evaluations on quantum hardware. Our experiments reveal that, despite challenges such as noise and short coherence times, the integration of QML and QPE shows promising potential for achieving quantum advantage in RL.


He worked with artificial limbs for decades. Then a lorry ripped off his right arm. What happened when the expert became the patient?

The Guardian

When the air ambulance brought Jim Ashworth-Beaumont to King's College hospital in south-east London, nobody thought he had a hope. He had been cycling home when a lorry driver failed to spot him alongside his trailer while turning left after a set of traffic lights. The vehicle's wheels opened his torso like a sardine tin, puncturing his lungs and splitting his liver in two. They also tore off his right arm. Weeks after the accident, in July 2020, Ashworth-Beaumont would see a photo of the severed limb taken by a doctor while it lay beside him in hospital. He had asked to see the picture and says it helped him come to terms with his loss. "My hand didn't look too bad," he says. "It was as if it was waving goodbye to me." Ashworth-Beaumont, a super-fit and sunny former Royal Marine from Edinburgh, would go on to spend six weeks in an induced coma as surgeons raced to repair his crushed body. But as he lay on the road, waiting for the paramedics, his only thoughts were that he was dying.


RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Mu, Yao, Chen, Tianxing, Chen, Zanxin, Peng, Shijia, Lan, Zhiqian, Gao, Zeyu, Liang, Zhixuan, Yu, Qiaojun, Zou, Yude, Xu, Mingkun, Lin, Lunkai, Xie, Zhiqiang, Ding, Mingyu, Luo, Ping

arXiv.org Artificial Intelligence

In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems. However, the scarcity of diverse, high-quality demonstration data and real-world-aligned evaluation benchmarks severely limits such development. To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks. Specifically, RoboTwin creates varied digital twins of objects from single 2D images, generating realistic and interactive scenarios. It also introduces a spatial relation-aware code generation framework that combines object annotations with large language models to break down tasks, determine spatial constraints, and generate precise robotic movement code. Our framework offers a comprehensive benchmark with both simulated and real-world data, enabling standardized evaluation and better alignment between simulated training and real-world performance. We validated our approach using the open-source COBOT Magic Robot platform. Policies pre-trained on RoboTwin-generated data and fine-tuned with limited real-world samples demonstrate significant potential for enhancing dual-arm robotic manipulation systems by improving success rates by over 70% for single-arm tasks and over 40% for dual-arm tasks compared to models trained solely on real-world data.


COMBO-Grasp: Learning Constraint-Based Manipulation for Bimanual Occluded Grasping

Yamada, Jun, Mitchell, Alexander L., Collins, Jack, Posner, Ingmar

arXiv.org Artificial Intelligence

This paper addresses the challenge of occluded robot grasping, i.e. grasping in situations where the desired grasp poses are kinematically infeasible due to environmental constraints such as surface collisions. Traditional robot manipulation approaches struggle with the complexity of non-prehensile or bimanual strategies commonly used by humans in these circumstances. State-of-the-art reinforcement learning (RL) methods are unsuitable due to the inherent complexity of the task. In contrast, learning from demonstration requires collecting a significant number of expert demonstrations, which is often infeasible. Instead, inspired by human bimanual manipulation strategies, where two hands coordinate to stabilise and reorient objects, we focus on a bimanual robotic setup to tackle this challenge. In particular, we introduce Constraint-based Manipulation for Bimanual Occluded Grasping (COMBO-Grasp), a learning-based approach which leverages two coordinated policies: a constraint policy trained using self-supervised datasets to generate stabilising poses and a grasping policy trained using RL that reorients and grasps the target object. A key contribution lies in value function-guided policy coordination. Specifically, during RL training for the grasping policy, the constraint policy's output is refined through gradients from a jointly trained value function, improving bimanual coordination and task performance. Lastly, COMBO-Grasp employs teacher-student policy distillation to effectively deploy point cloud-based policies in real-world environments. Empirical evaluations demonstrate that COMBO-Grasp significantly improves task success rates compared to competitive baseline approaches, with successful generalisation to unseen objects in both simulated and real-world environments.


Review for NeurIPS paper: Counterfactual Data Augmentation using Locally Factored Dynamics

Neural Information Processing Systems

Reviewers were positive and excited about the paper, and I agree with the general sentiment that the work is a significant step in the right direction. Having said that, there are some issues that I would like to see fixed to make its final version more comfortable to read, sound, consistent, and well-positioned regarding the broader literature. Towards this goal, first, read the reviews carefully and try to incorporate their feedback as much as you can. I will list some critical issues below, mostly in addition to the ones raised by the reviewers. Please, re-define causal model to account for the bipartite structure mentioned in the rebuttal; that's a strong constraint over the SCM-space but appears to be enough for the paper's purposes.


A Concise Mathematical Description of Active Inference in Discrete Time

van Oostrum, Jesse, Langer, Carlotta, Ay, Nihat

arXiv.org Artificial Intelligence

Active inference is a theory that describes the behavior (action selection mechanism) of an agent in an environment. We aim to present a concise mathematical description of the theory so that a reader interested in the mathematical details can quickly find what they are looking for. We have paid special attention to choosing notation that is more in line with standard mathematical texts and is also descriptive, in the sense that dependencies are made explicit. The aim of this paper is not to justify the theory or convince the reader that this is right theory. The paper is divided into a main text and an appendix. The main text aims to present a clear and simple picture of active inference in discrete time that is accessible for people new to the topic. It is further subdivided into an inference part, which assumes the existence of a generative model, a learning part, in which we discuss how the agent can learn this model, and an example, illustrating the action selection mechanism. In the appendix the more subtle details and derivations are discussed. This part is aimed at people who have already studied the active inference literature but struggle to make sense of the mathematical details.